[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

cadedaniel · 2024-03-19T08:18:33Z

This PR refactors the block manager / allocator / prefix caching to make things easier to test. Concretely, it establishes the following layers (diagrams from https://docs.google.com/document/d/1ipAypZYfZgloP_08sLi1Z2BADHMauWScbT_H1F0ThCQ/edit?pli=1):

The interfaces are such that the BlockAllocator can be implemented via a NaiveBlockAllocator (e.g. no prefix caching) or a PrefixCachingBlockAllocator. The value of this approach is in its separation of concerns and consequent improved testability.

This PR is missing the following features before the BlockManagerV2 can become the default.

Swap in/swap out implementation
BlockTable supports sliding window
[Prefix caching] Evictor policies (currently an arbitrary block is evicted)
[Prefix caching] Tests on prefix caching (e.g. prefix blocks are not evicted)
[Prefix caching] Update last_access_time for blocks
[Prefix caching] Track computed bit.

Testing

Unit tests

=================== 639 passed in 1.73s ==========

Correctness test (comparing model output with v1 and v2 block managers, with preemption)

===== 1 passed in 37.54s ========

Design

Key APIs

class BlockAllocator:
	def allocate_mutable(self, prev_block: Block) -> Block:
		# A block is mutable if it is not full.
		...
	def allocate_immutable(self, prev_block: Block, token_ids: List[int]) -> Block:
		# A block is immutable if it is full.
		...
	def free(block) -> None:
		...

class Block:
	def append_token_ids(self, token_ids: List[int]) -> None:
		pass


class PrefixCachingBlock(Block):
	def content_hash(self) -> Optional[int]:
		# None if the block is mutable.
		return hash(self.prev_block.content_hash, self.token_ids)

Important interactions

Swapping GPU/CPU

Note this is not implemented in this PR

Copy-on-write

Prefix caching promotion

cadedaniel · 2024-03-27T21:24:11Z

Added docstrings
Cleaned up code
Added e2e correctness test. It compares generated token ids of opt-125m using block manager v1 and v2, using greedy sampling. The test includes preemption -- if the output is the same, it means the v2 block manager is correct.

cadedaniel · 2024-03-27T21:24:29Z

@simon-mo ready for review

rkooo567 · 2024-03-27T22:29:40Z

vllm/core/scheduler.py

+
+        # Now that the batch has been created, we can assume all blocks in the
+        # batch will have been computed before the next scheduling invocation.
+        for seq_group in scheduler_outputs.scheduled_seq_groups:


This assumes the model execution won't fail & there's no retry. It is the general assumption in the engine now?

Add this assumption to comments? What happen to later preemption?

comment added

What happen to later preemption

these lines do not change prefix caching behavior wrt later preemption (either case requires recomputing the blocks)

oh btw @rkooo567 this assumption is not introduced in this PR; the scheduler already assumes that its schedule is implemented by the engine. retries and failures are not yet in scope of vllm

simon-mo

Thanks for the great work. The new structure is looking great. I don't have major issues therefore approving. I left some comments mostly due to my confusion when reading the code, which I would imagine others will run into as well.

simon-mo · 2024-03-28T02:50:39Z

tests/core/block/e2e/conftest.py

+    def generator_outer():
+        for llm in generator_inner():
+            yield llm
+            del llm


why do we need another level of wrapper? would it work without it? if not please comment why

I see the usage of the llm generator below but still confused since we are only yielding one llm instance

oh good catch, not necessary

simon-mo · 2024-03-28T03:13:22Z

vllm/core/block/interfaces.py

+        pass
+
+
+class DeviceAwareBlockAllocator(ABC):


I do wonder whether this can "inherit" BlockAllocator somehow so we are only re-defining allocate_mutable and allocate_immutable.

simon-mo · 2024-03-28T03:18:22Z

vllm/core/scheduler.py

+
+        # Now that the batch has been created, we can assume all blocks in the
+        # batch will have been computed before the next scheduling invocation.
+        for seq_group in scheduler_outputs.scheduled_seq_groups:


Add this assumption to comments? What happen to later preemption?

simon-mo · 2024-03-28T03:19:34Z

vllm/sequence.py

@@ -196,6 +196,8 @@ def lora_int_id(self) -> int:
        return self.lora_request.lora_int_id if self.lora_request else 0

    def hash_of_block(self, logical_idx: int) -> int:
+        # TODO This can produce incorrect hash when block size > prompt size


simon-mo · 2024-03-28T03:46:57Z

vllm/core/block_manager_v2.py

+    def access_all_blocks_in_seq(self, seq, now):
+        pass


what's the todo here?

added comment (basically #3667)

simon-mo · 2024-03-28T03:47:47Z

vllm/core/block_manager_v2.py

+    def mark_blocks_as_computed(self, seq_group: SequenceGroup):
+        # We ignore the sequence group as its not necessary. After the batch is
+        # formed by the scheduler, we do not need to mark blocks from individual
+        # sequence groups as computed -- all blocks in the batch can be marked
+        # as computed.


is this a no-op then?

kinda -- the plumbing is here to validate the design, but prefix caching isn't tested e2e (need #3667)

simon-mo · 2024-03-28T03:56:56Z

vllm/core/block/naive_block.py

+    caching.
+
+    Args:
+        create_block (Block.Factory): A factory function for creating new


can we parameterize the type somehow so we are constraining the type to NaiveBlock because it only works there.

hm, open to more typing, but here we need it to return an unspecialized block. the prefix caching allocator will specify to the naive block allocator that it should construct prefix caching blocks instead of the default naive block.

I will add a comment

simon-mo · 2024-03-28T03:59:17Z

vllm/core/block/naive_block.py

+
+    def free(self, block: Block) -> None:
+        block_id = block.block_id
+        block.block_id = None


why is this needed? plz comment?

simon-mo · 2024-03-28T04:00:25Z

vllm/core/block/naive_block.py

+Refcount = int
+
+
+class NaiveBlockAllocator(BlockAllocator):


I think this should be called CoWBlockAllocator btw because it's actually quite powerful

I thought about this -- the conflict is that the prefix caching allocator also supports CoW (it isn't unique to this allocator).

cadedaniel · 2024-03-28T04:09:26Z

Thanks for the review 🙏. Applying feedback

…ject#3492)

cadedaniel added 30 commits March 11, 2024 00:30

logical block test

85fb179

sequence

0f19984

notes

0306a8c

wip

de14e54

prefix caching bug when prompt len < block size

7d66c4a

wip

e03e057

refcount

c162283

wip

99a5b59

wip

1fe4cbb

wip

5e70924

wip

ea94ecc

wip

376cdb6

wip

d7e122e

wip

2b821dc

wip

0a6fbd2

wip

658b4c5

wip

e976541

content hash

085f419

wip

6fc22ef

unused cached blocks

cbea543

wip

029d39a

wip

d2ca90b

break files

1eee08c

wip

ebe6ccf

wip

9dfc821

wip

619fb0d

device aware

ea49f23

wip

1252223

wip

c1e1b2f

wip0

d0b4f20

cadedaniel added 7 commits March 27, 2024 14:05

name

9000b41

rename

a2897b0

wip

bead69a

clean

132e7a3

comment

0d75e12

lint

70d1812

comment

321dc16

simon-mo self-assigned this Mar 27, 2024

This was referenced Mar 27, 2024

[Misc]: Implement SlidingWindowBlockTable in BlockManagerV2 #3665

Closed

[Misc]: Implement CPU/GPU swapping in BlockManagerV2 #3666

Closed

[Misc]: Prefix caching in the BlockManagerV2 #3667

Closed

rkooo567 reviewed Mar 27, 2024

View reviewed changes

cadedaniel added 2 commits March 27, 2024 15:40

fix test

887496b

empty

f0b1bf1

cadedaniel mentioned this pull request Mar 28, 2024

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

Merged

simon-mo approved these changes Mar 28, 2024

View reviewed changes

pr feedback

5b86297

youkaichao merged commit 14ccd94 into vllm-project:main Mar 28, 2024
33 checks passed

cadedaniel deleted the block-manager-tests branch March 28, 2024 07:15

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 31, 2024

[Core][Bugfix]Refactor block manager for better testability (vllm-pro…

064a50e

…ject#3492)

cadedaniel mentioned this pull request May 1, 2024

[Tracking issue] [Help wanted]: Deprecate BlockManagerV1 #4537

Open

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

KrishnaM251 mentioned this pull request Jul 26, 2024

[Core] Get KV from Block, add KV to Block #6808

Draft

cadedaniel mentioned this pull request Aug 15, 2024

[Bug] [BlockManagerV2]: Prefill for sliding window models can allocate more blocks than sliding window size #7470

Open

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][Bugfix]Refactor block manager for better testability (vllm-pro…

d522b98

…ject#3492)

mino-park7 mentioned this pull request Sep 9, 2024

[Misc]: What is the difference between vllm.core.interfaces.BlockAllocator and vllm.core.block_manager_v1.BlockAllocatorBase? #8285

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

cadedaniel commented Mar 19, 2024 •

edited

Loading

cadedaniel commented Mar 27, 2024

cadedaniel commented Mar 27, 2024

rkooo567 Mar 27, 2024

cadedaniel Mar 27, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo left a comment

simon-mo Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

simon-mo Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

simon-mo Mar 28, 2024

cadedaniel Mar 28, 2024

cadedaniel commented Mar 28, 2024 •

edited

Loading

[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

Conversation

cadedaniel commented Mar 19, 2024 • edited Loading

Testing

Design

Key APIs

Important interactions

Swapping GPU/CPU

Copy-on-write

Prefix caching promotion

cadedaniel commented Mar 27, 2024

cadedaniel commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cadedaniel commented Mar 28, 2024 • edited Loading

cadedaniel commented Mar 19, 2024 •

edited

Loading

cadedaniel commented Mar 28, 2024 •

edited

Loading